Regional plot¶
In [1]:
Copied!
import sys
sys.path.insert(0,"/home/yunye/gwaslab/gwaslab/src")
import gwaslab as gl
import sys
sys.path.insert(0,"/home/yunye/gwaslab/gwaslab/src")
import gwaslab as gl
In [2]:
Copied!
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
--2022-10-20 22:22:11-- http://jenger.riken.jp/14/ Resolving jenger.riken.jp (jenger.riken.jp)... 134.160.84.25 Connecting to jenger.riken.jp (jenger.riken.jp)|134.160.84.25|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 274187574 (261M) [text/plain] Saving to: ‘t2d_bbj.txt.gz’ t2d_bbj.txt.gz 100%[===================>] 261.49M 10.6MB/s in 25s 2022-10-20 22:22:36 (10.6 MB/s) - ‘t2d_bbj.txt.gz’ saved [274187574/274187574]
In [2]:
Copied!
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
ea="ALT",
nea="REF",
neaf="Frq",
beta="BETA",
se="SE",
p="P",
direction="Dir",
n="N")
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
snpid="SNP",
chrom="CHR",
pos="POS",
ea="ALT",
nea="REF",
neaf="Frq",
beta="BETA",
se="SE",
p="P",
direction="Dir",
n="N")
Fri Oct 21 01:15:35 2022 Start to initiate from file :t2d_bbj.txt.gz Fri Oct 21 01:15:52 2022 -Reading columns : ALT,CHR,SNP,N,Frq,Dir,REF,BETA,POS,SE,P Fri Oct 21 01:15:52 2022 -Renaming columns to : EA,CHR,SNPID,N,EAF,DIRECTION,NEA,BETA,POS,SE,P Fri Oct 21 01:15:52 2022 -Current dataframe shape : Rows 12557761 x 11 Columns Fri Oct 21 01:15:53 2022 -Initiating a status column ... Fri Oct 21 01:15:54 2022 -Reordering columns to : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS Fri Oct 21 01:15:55 2022 -NEAF is specified... Fri Oct 21 01:15:55 2022 -Checking if 0<= NEAF <=1 ... Fri Oct 21 01:15:57 2022 -Converted NEAF to EAF. Fri Oct 21 01:15:57 2022 -Removed 0 variants with bad NEAF. Fri Oct 21 01:15:57 2022 Finished loading data successfully!
In [4]:
Copied!
mysumstats.plot_mqq(skip=3,cut=20, mode="mqq",stratified=True)
mysumstats.plot_mqq(skip=3,cut=20, mode="mqq",stratified=True)
Thu Oct 20 21:13:27 2022 Start to plot manhattan/qq plot with the following basic settings: Thu Oct 20 21:13:27 2022 -Genome-wide significance level is set to 5e-08 ... Thu Oct 20 21:13:27 2022 -Raw input contains 12557761 variants... Thu Oct 20 21:13:27 2022 -Plot layout mode is : mqq Thu Oct 20 21:13:36 2022 Finished loading specified columns from the sumstats. Thu Oct 20 21:13:36 2022 Start conversion and sanity check: Thu Oct 20 21:13:38 2022 -Removed 328791 variants with nan in CHR or POS column ... Thu Oct 20 21:13:38 2022 -Removed 0 variants with nan in EAF column ... Thu Oct 20 21:13:39 2022 -Removed 0 variants with nan in P column ... Thu Oct 20 21:13:39 2022 -P values are being converted to -log10(P)... Thu Oct 20 21:13:39 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Thu Oct 20 21:13:41 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Thu Oct 20 21:13:43 2022 -Maximum -log10(P) values is 167.58838029403677 . Thu Oct 20 21:13:43 2022 -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10... Thu Oct 20 21:13:43 2022 Finished data conversion and sanity check. Thu Oct 20 21:13:43 2022 Start to create manhattan plot with 87543 variants: Thu Oct 20 21:13:44 2022 -Found 84 significant variants with a sliding window size of 500 kb... Thu Oct 20 21:13:44 2022 Finished creating Manhattan plot successfully! Thu Oct 20 21:13:44 2022 -Skip annotating Thu Oct 20 21:13:44 2022 Start to create QQ plot with 87543 variants: Thu Oct 20 21:13:45 2022 -Calculating GC using P : 1.2139048028292598 Thu Oct 20 21:13:45 2022 Finished creating QQ plot successfully!
Out[4]:
(<Figure size 1500x500 with 2 Axes>, <gwaslab.Log.Log at 0x7ff2b0428e20>)
In [3]:
Copied!
mysumstats.filter_in(eq={"CHR":"7"})
mysumstats.basic_check()
mysumstats.filter_in(eq={"CHR":"7"})
mysumstats.basic_check()
Fri Oct 21 01:15:57 2022 Start filtering values: Fri Oct 21 01:15:58 2022 -Keeping 707780 variants with CHR = 7 ... Fri Oct 21 01:15:58 2022 Finished filtering values. Fri Oct 21 01:15:59 2022 Start to check IDs... Fri Oct 21 01:15:59 2022 -Current Dataframe shape : 707780 x 12 Fri Oct 21 01:15:59 2022 -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _) Fri Oct 21 01:16:00 2022 Finished checking IDs successfully! Fri Oct 21 01:16:00 2022 Start to fix chromosome notation... Fri Oct 21 01:16:00 2022 -Current Dataframe shape : 707780 x 12 Fri Oct 21 01:16:01 2022 -All CHR are already fixed... Fri Oct 21 01:16:02 2022 Finished fixing chromosome notation successfully! Fri Oct 21 01:16:02 2022 Start to fix basepair positions... Fri Oct 21 01:16:02 2022 -Current Dataframe shape : 707780 x 12 Fri Oct 21 01:16:03 2022 -Position upper_bound is: 250,000,000 Fri Oct 21 01:16:04 2022 -Remove outliers: 0 Fri Oct 21 01:16:04 2022 -Converted all position to datatype Int64. Fri Oct 21 01:16:04 2022 Finished fixing basepair position successfully! Fri Oct 21 01:16:04 2022 Start to fix alleles... Fri Oct 21 01:16:04 2022 -Current Dataframe shape : 707780 x 12 Fri Oct 21 01:16:05 2022 -Converted all bases to string datatype and UPPERCASE. Fri Oct 21 01:16:05 2022 Finished fixing allele successfully! Fri Oct 21 01:16:05 2022 Start sanity check for statistics ... Fri Oct 21 01:16:05 2022 -Current Dataframe shape : 707780 x 12 Fri Oct 21 01:16:05 2022 -Checking if N is Int64 and N>0 ... Fri Oct 21 01:16:05 2022 -Removed 0 variants with bad N. Fri Oct 21 01:16:05 2022 -Checking if 0<= EAF <=1 ... Fri Oct 21 01:16:06 2022 -Removed 0 variants with bad EAF. Fri Oct 21 01:16:06 2022 -Checking if MAC >=5 ... Fri Oct 21 01:16:06 2022 -Removed 0 variants with bad MAC. Fri Oct 21 01:16:06 2022 -Checking if 0< P <5e-300 ... Fri Oct 21 01:16:06 2022 -Removed 0 variants with bad P. Fri Oct 21 01:16:06 2022 -Checking if abs(BETA)<10 ... Fri Oct 21 01:16:06 2022 -Removed 0 variants with bad BETA. Fri Oct 21 01:16:06 2022 -Checking if SE >0 ... Fri Oct 21 01:16:06 2022 -Removed 0 variants with bad SE. Fri Oct 21 01:16:06 2022 -Checking STATUS... Fri Oct 21 01:16:06 2022 -Coverting STAUTUS to category. Fri Oct 21 01:16:06 2022 -Removed 0 variants with bad statistics in total. Fri Oct 21 01:16:06 2022 Finished sanity check successfully! Fri Oct 21 01:16:06 2022 Start to normalize variants... Fri Oct 21 01:16:06 2022 -Current Dataframe shape : 707780 x 12 Fri Oct 21 01:16:07 2022 -No available variants to normalize.. Fri Oct 21 01:16:07 2022 Finished normalizing variants successfully!
In [7]:
Copied!
mysumstats.plot_mqq(mode="m",anno="GENENAME",anno_source="ensembl",build="19")
mysumstats.plot_mqq(mode="m",anno="GENENAME",anno_source="ensembl",build="19")
Thu Oct 20 21:14:02 2022 Start to plot manhattan/qq plot with the following basic settings: Thu Oct 20 21:14:02 2022 -Genome-wide significance level is set to 5e-08 ... Thu Oct 20 21:14:02 2022 -Raw input contains 707780 variants... Thu Oct 20 21:14:02 2022 -Plot layout mode is : m Thu Oct 20 21:14:02 2022 Finished loading specified columns from the sumstats. Thu Oct 20 21:14:02 2022 Start conversion and sanity check: Thu Oct 20 21:14:02 2022 -Removed 0 variants with nan in CHR or POS column ... Thu Oct 20 21:14:02 2022 -Removed 0 variants with nan in P column ... Thu Oct 20 21:14:02 2022 -P values are being converted to -log10(P)... Thu Oct 20 21:14:02 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Thu Oct 20 21:14:02 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Thu Oct 20 21:14:02 2022 -Maximum -log10(P) values is 73.38711023071251 . Thu Oct 20 21:14:02 2022 Finished data conversion and sanity check. Thu Oct 20 21:14:03 2022 Start to create manhattan plot with 707741 variants:
INFO:pyensembl.database:Creating database: /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Ensembl/release75/Homo_sapiens.GRCh37.75.protein_coding.gtf.db INFO:pyensembl.database:Reading GTF from /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Ensembl/release75/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz
Thu Oct 20 21:14:04 2022 -Found 7 significant variants with a sliding window size of 500 kb... Thu Oct 20 21:14:04 2022 Start to annotate variants with nearest gene name(s)... Thu Oct 20 21:14:04 2022 -Assigning Gene name using built-in Ensembl Release 75 (hg19)
/home/yunye/anaconda3/lib/python3.9/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future. chunk_iterator = pd.read_csv( /home/yunye/anaconda3/lib/python3.9/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The warn_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future. chunk_iterator = pd.read_csv( INFO:root:Extracted GTF attributes: ['gene_id', 'transcript_id', 'gene_name', 'gene_biotype', 'transcript_name', 'exon_number', 'exon_id', 'ccds_id', 'protein_id'] INFO:root:Using column 'source' to replace missing 'transcript_biotype' INFO:datacache.database_helpers:Creating database /home/yunye/gwaslab/gwaslab/src/gwaslab/data/Ensembl/release75/Homo_sapiens.GRCh37.75.protein_coding.gtf.db containing: stop_codon, start_codon, transcript, CDS, gene, exon INFO:datacache.database:Running sqlite query: "CREATE TABLE stop_codon (transcript_name TEXT NOT NULL, end INT NOT NULL, exon_id TEXT NOT NULL, seqname TEXT NOT NULL, feature TEXT NOT NULL, source TEXT NOT NULL, ccds_id TEXT NOT NULL, transcript_id TEXT NOT NULL, start INT NOT NULL, strand TEXT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, transcript_biotype TEXT NOT NULL, gene_name TEXT NOT NULL, gene_biotype TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 73352 rows into table stop_codon INFO:datacache.database:Creating index on stop_codon (seqname, start, end) INFO:datacache.database:Creating index on stop_codon (gene_name) INFO:datacache.database:Creating index on stop_codon (gene_id) INFO:datacache.database:Creating index on stop_codon (transcript_id) INFO:datacache.database:Creating index on stop_codon (transcript_name) INFO:datacache.database:Creating index on stop_codon (exon_id) INFO:datacache.database:Creating index on stop_codon (protein_id) INFO:datacache.database:Creating index on stop_codon (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE start_codon (transcript_name TEXT NOT NULL, end INT NOT NULL, exon_id TEXT NOT NULL, seqname TEXT NOT NULL, feature TEXT NOT NULL, source TEXT NOT NULL, ccds_id TEXT NOT NULL, transcript_id TEXT NOT NULL, start INT NOT NULL, strand TEXT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, transcript_biotype TEXT NOT NULL, gene_name TEXT NOT NULL, gene_biotype TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 73293 rows into table start_codon INFO:datacache.database:Creating index on start_codon (seqname, start, end) INFO:datacache.database:Creating index on start_codon (gene_name) INFO:datacache.database:Creating index on start_codon (gene_id) INFO:datacache.database:Creating index on start_codon (transcript_id) INFO:datacache.database:Creating index on start_codon (transcript_name) INFO:datacache.database:Creating index on start_codon (exon_id) INFO:datacache.database:Creating index on start_codon (protein_id) INFO:datacache.database:Creating index on start_codon (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE transcript (transcript_name TEXT NOT NULL, end INT NOT NULL, exon_id TEXT NOT NULL, seqname TEXT NOT NULL, feature TEXT NOT NULL, source TEXT NOT NULL, ccds_id TEXT NOT NULL, transcript_id TEXT UNIQUE PRIMARY KEY NOT NULL, start INT NOT NULL, strand TEXT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, transcript_biotype TEXT NOT NULL, gene_name TEXT NOT NULL, gene_biotype TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 165581 rows into table transcript INFO:datacache.database:Creating index on transcript (seqname, start, end) INFO:datacache.database:Creating index on transcript (gene_name) INFO:datacache.database:Creating index on transcript (gene_id) INFO:datacache.database:Creating index on transcript (transcript_name) INFO:datacache.database:Creating index on transcript (exon_id) INFO:datacache.database:Creating index on transcript (protein_id) INFO:datacache.database:Creating index on transcript (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE CDS (transcript_name TEXT NOT NULL, end INT NOT NULL, exon_id TEXT NOT NULL, seqname TEXT NOT NULL, feature TEXT NOT NULL, source TEXT NOT NULL, ccds_id TEXT NOT NULL, transcript_id TEXT NOT NULL, start INT NOT NULL, strand TEXT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, transcript_biotype TEXT NOT NULL, gene_name TEXT NOT NULL, gene_biotype TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 791057 rows into table CDS INFO:datacache.database:Creating index on CDS (seqname, start, end) INFO:datacache.database:Creating index on CDS (gene_name) INFO:datacache.database:Creating index on CDS (gene_id) INFO:datacache.database:Creating index on CDS (transcript_id) INFO:datacache.database:Creating index on CDS (transcript_name) INFO:datacache.database:Creating index on CDS (exon_id) INFO:datacache.database:Creating index on CDS (protein_id) INFO:datacache.database:Creating index on CDS (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE gene (transcript_name TEXT NOT NULL, end INT NOT NULL, exon_id TEXT NOT NULL, seqname TEXT NOT NULL, feature TEXT NOT NULL, source TEXT NOT NULL, ccds_id TEXT NOT NULL, transcript_id TEXT NOT NULL, start INT NOT NULL, strand TEXT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT UNIQUE PRIMARY KEY NOT NULL, transcript_biotype TEXT NOT NULL, gene_name TEXT NOT NULL, gene_biotype TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 24159 rows into table gene INFO:datacache.database:Creating index on gene (seqname, start, end) INFO:datacache.database:Creating index on gene (gene_name) INFO:datacache.database:Creating index on gene (transcript_id) INFO:datacache.database:Creating index on gene (transcript_name) INFO:datacache.database:Creating index on gene (exon_id) INFO:datacache.database:Creating index on gene (protein_id) INFO:datacache.database:Creating index on gene (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE exon (transcript_name TEXT NOT NULL, end INT NOT NULL, exon_id TEXT NOT NULL, seqname TEXT NOT NULL, feature TEXT NOT NULL, source TEXT NOT NULL, ccds_id TEXT NOT NULL, transcript_id TEXT NOT NULL, start INT NOT NULL, strand TEXT NOT NULL, exon_number TEXT NOT NULL, gene_id TEXT NOT NULL, transcript_biotype TEXT NOT NULL, gene_name TEXT NOT NULL, gene_biotype TEXT NOT NULL, protein_id TEXT NOT NULL)" INFO:datacache.database:Inserting 1195297 rows into table exon INFO:datacache.database:Creating index on exon (seqname, start, end) INFO:datacache.database:Creating index on exon (gene_name) INFO:datacache.database:Creating index on exon (gene_id) INFO:datacache.database:Creating index on exon (transcript_id) INFO:datacache.database:Creating index on exon (transcript_name) INFO:datacache.database:Creating index on exon (exon_id) INFO:datacache.database:Creating index on exon (protein_id) INFO:datacache.database:Creating index on exon (ccds_id) INFO:datacache.database:Running sqlite query: "CREATE TABLE _datacache_metadata (version INT)" INFO:datacache.database:Running sqlite query: "INSERT INTO _datacache_metadata VALUES (3)"
Thu Oct 20 21:15:04 2022 Finished annotating variants with nearest gene name(s) successfully! Thu Oct 20 21:15:04 2022 Finished creating Manhattan plot successfully! Thu Oct 20 21:15:04 2022 -Annotating using column GENENAME...
Out[7]:
(<Figure size 1500x500 with 1 Axes>, <gwaslab.Log.Log at 0x7ff2b0428e20>)
In [8]:
Copied!
mysumstats.get_lead()
mysumstats.get_lead()
Thu Oct 20 21:15:29 2022 Start to extract lead variants... Thu Oct 20 21:15:29 2022 -Processing 707780 variants... Thu Oct 20 21:15:29 2022 -Significance threshold : 5e-08 Thu Oct 20 21:15:29 2022 -Sliding window size: 500 kb Thu Oct 20 21:15:29 2022 -Found 1077 significant variants in total... Thu Oct 20 21:15:30 2022 -Identified 7 lead variants! Thu Oct 20 21:15:30 2022 Finished extracting lead variants successfully!
Out[8]:
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5474489 | 7:13888699_G_C | 7 | 13888699 | G | C | 0.5680 | 0.0562 | 0.0094 | 2.507000e-09 | 191764 | ++++ | 9960099 |
| 5480067 | 7:14898282_C_T | 7 | 14898282 | C | T | 0.6012 | 0.0617 | 0.0088 | 2.336000e-12 | 191764 | ++++ | 9960099 |
| 5626346 | 7:44174857_T_G | 7 | 44174857 | G | T | 0.5985 | -0.0640 | 0.0093 | 5.325000e-12 | 191764 | ---- | 9960099 |
| 5732279 | 7:69406661_A_T | 7 | 69406661 | T | A | 0.1981 | -0.0900 | 0.0111 | 4.871000e-16 | 191764 | ---- | 9960099 |
| 5965364 | 7:127253550_C_T | 7 | 127253550 | C | T | 0.9081 | 0.2761 | 0.0152 | 4.101000e-74 | 191764 | ++++ | 9960099 |
| 5976830 | 7:130025713_G_A | 7 | 130025713 | G | A | 0.9530 | -0.1365 | 0.0230 | 3.068000e-09 | 191764 | ---- | 9960099 |
| 6092347 | 7:157038803_A_G | 7 | 157038803 | G | A | 0.4626 | -0.0502 | 0.0088 | 1.127000e-08 | 191764 | ---- | 9960099 |
In [4]:
Copied!
mysumstats.get_lead(anno=True)
mysumstats.get_lead(anno=True)
Thu Oct 20 22:41:42 2022 Start to extract lead variants... Thu Oct 20 22:41:42 2022 -Processing 707780 variants... Thu Oct 20 22:41:42 2022 -Significance threshold : 5e-08 Thu Oct 20 22:41:42 2022 -Sliding window size: 500 kb Thu Oct 20 22:41:42 2022 -Found 1077 significant variants in total... Thu Oct 20 22:41:42 2022 -Identified 7 lead variants! Thu Oct 20 22:41:42 2022 Start to annotate variants with nearest gene name(s)... Thu Oct 20 22:41:42 2022 -Assigning Gene name using built-in Ensembl Release 75 (hg19) Thu Oct 20 22:41:44 2022 Finished annotating variants with nearest gene name(s) successfully! Thu Oct 20 22:41:44 2022 Finished extracting lead variants successfully!
Out[4]:
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | LOCATION | GENE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5474489 | 7:13888699_G_C | 7 | 13888699 | G | C | 0.5680 | 0.0562 | 0.0094 | 2.507000e-09 | 191764 | ++++ | 9960099 | 42154 | ETV1 |
| 5480067 | 7:14898282_C_T | 7 | 14898282 | C | T | 0.6012 | 0.0617 | 0.0088 | 2.336000e-12 | 191764 | ++++ | 9960099 | 0 | DGKB |
| 5626346 | 7:44174857_T_G | 7 | 44174857 | G | T | 0.5985 | -0.0640 | 0.0093 | 5.325000e-12 | 191764 | ---- | 9960099 | 3606 | MYL7 |
| 5732279 | 7:69406661_A_T | 7 | 69406661 | T | A | 0.1981 | -0.0900 | 0.0111 | 4.871000e-16 | 191764 | ---- | 9960099 | 0 | AUTS2 |
| 5965364 | 7:127253550_C_T | 7 | 127253550 | C | T | 0.9081 | 0.2761 | 0.0152 | 4.101000e-74 | 191764 | ++++ | 9960099 | 0 | PAX4 |
| 5976830 | 7:130025713_G_A | 7 | 130025713 | G | A | 0.9530 | -0.1365 | 0.0230 | 3.068000e-09 | 191764 | ---- | 9960099 | 0 | CPA1 |
| 6092347 | 7:157038803_A_G | 7 | 157038803 | G | A | 0.4626 | -0.0502 | 0.0088 | 1.127000e-08 | 191764 | ---- | 9960099 | 0 | UBE3C |
In [7]:
Copied!
mysumstats.plot_mqq(region=(7,156538803,157538803))
mysumstats.plot_mqq(region=(7,156538803,157538803))
Wed Sep 21 14:47:58 2022 Start to plot manhattan/qq plot with the following basic settings: Wed Sep 21 14:47:58 2022 -Genome-wide significance level is set to 5e-08 ... Wed Sep 21 14:47:58 2022 -Raw input contains 707780 variants... Wed Sep 21 14:47:58 2022 -Plot layout mode is : mqq Wed Sep 21 14:47:58 2022 -Region to plot : chr7:156538803-157538803. Wed Sep 21 14:47:59 2022 -Extract SNPs in region : chr7:156538803-157538803... Wed Sep 21 14:48:01 2022 -Extract SNPs in specified regions: 5831 Wed Sep 21 14:48:01 2022 Finished loading specified columns from the sumstats. Wed Sep 21 14:48:01 2022 Start conversion and sanity check: Wed Sep 21 14:48:01 2022 -Removed 0 variants with nan in CHR or POS column ... Wed Sep 21 14:48:01 2022 -Removed 0 variants with nan in P column ... Wed Sep 21 14:48:01 2022 -P values are being converted to -log10(P)... Wed Sep 21 14:48:01 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Wed Sep 21 14:48:01 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Wed Sep 21 14:48:01 2022 -Maximum -log10(P) values is 7.948076083953893 . Wed Sep 21 14:48:01 2022 Finished data conversion and sanity check. Wed Sep 21 14:48:01 2022 Start to create manhattan plot with 5831 variants: Wed Sep 21 14:48:02 2022 -Found 1 significant variants with a sliding window size of 500 kb... Wed Sep 21 14:48:02 2022 -Skip annotating Wed Sep 21 14:48:02 2022 Finished creating Manhattan plot successfully! Wed Sep 21 14:48:02 2022 Start to create QQ plot with 5831 variants: Wed Sep 21 14:48:02 2022 -Calculating GC using P : 1.7191388184129959 Wed Sep 21 14:48:02 2022 Finished creating QQ plot successfully!
Out[7]:
(<Figure size 1500x500 with 3 Axes>, <gwaslab.Log.Log at 0x7fc6dd3f0220>)
In [10]:
Copied!
mysumstats.plot_mqq(mode="r", region=(7,156538803,157538803),region_grid=True, gtf_path="ensembl")
mysumstats.plot_mqq(mode="r", region=(7,156538803,157538803),region_grid=True, gtf_path="ensembl")
Thu Oct 20 21:16:14 2022 Start to plot manhattan/qq plot with the following basic settings: Thu Oct 20 21:16:14 2022 -Genome-wide significance level is set to 5e-08 ... Thu Oct 20 21:16:14 2022 -Raw input contains 707780 variants... Thu Oct 20 21:16:14 2022 -Plot layout mode is : r Thu Oct 20 21:16:14 2022 -Region to plot : chr7:156538803-157538803. Thu Oct 20 21:16:14 2022 -Extract SNPs in region : chr7:156538803-157538803... Thu Oct 20 21:16:16 2022 -Extract SNPs in specified regions: 5831 Thu Oct 20 21:16:16 2022 Finished loading specified columns from the sumstats. Thu Oct 20 21:16:16 2022 Start conversion and sanity check: Thu Oct 20 21:16:16 2022 -Removed 0 variants with nan in P column ... Thu Oct 20 21:16:16 2022 -P values are being converted to -log10(P)... Thu Oct 20 21:16:16 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Thu Oct 20 21:16:16 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Thu Oct 20 21:16:16 2022 -Maximum -log10(P) values is 7.948076083953893 . Thu Oct 20 21:16:16 2022 Finished data conversion and sanity check. Thu Oct 20 21:16:16 2022 Start to create manhattan plot with 5831 variants: Thu Oct 20 21:16:16 2022 -Loading gtf files from:ensembl Thu Oct 20 21:16:24 2022 -plotting gene track.. Thu Oct 20 21:16:24 2022 -Finished plotting gene track.. Thu Oct 20 21:16:25 2022 -Found 1 significant variants with a sliding window size of 500 kb... Thu Oct 20 21:16:25 2022 Finished creating Manhattan plot successfully! Thu Oct 20 21:16:25 2022 -Skip annotating
Out[10]:
(<Figure size 1500x1000 with 3 Axes>, <gwaslab.Log.Log at 0x7ff2b0428e20>)
In [4]:
Copied!
lead_variants = mysumstats.get_lead()
lead_variants
lead_variants = mysumstats.get_lead()
lead_variants
Fri Oct 21 01:16:07 2022 Start to extract lead variants... Fri Oct 21 01:16:07 2022 -Processing 707780 variants... Fri Oct 21 01:16:07 2022 -Significance threshold : 5e-08 Fri Oct 21 01:16:07 2022 -Sliding window size: 500 kb Fri Oct 21 01:16:07 2022 -Found 1077 significant variants in total... Fri Oct 21 01:16:07 2022 -Identified 7 lead variants! Fri Oct 21 01:16:07 2022 Finished extracting lead variants successfully!
Out[4]:
| SNPID | CHR | POS | EA | NEA | EAF | BETA | SE | P | N | DIRECTION | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5474489 | 7:13888699_G_C | 7 | 13888699 | G | C | 0.5680 | 0.0562 | 0.0094 | 2.507000e-09 | 191764 | ++++ | 9960099 |
| 5480067 | 7:14898282_C_T | 7 | 14898282 | C | T | 0.6012 | 0.0617 | 0.0088 | 2.336000e-12 | 191764 | ++++ | 9960099 |
| 5626346 | 7:44174857_T_G | 7 | 44174857 | G | T | 0.5985 | -0.0640 | 0.0093 | 5.325000e-12 | 191764 | ---- | 9960099 |
| 5732279 | 7:69406661_A_T | 7 | 69406661 | T | A | 0.1981 | -0.0900 | 0.0111 | 4.871000e-16 | 191764 | ---- | 9960099 |
| 5965364 | 7:127253550_C_T | 7 | 127253550 | C | T | 0.9081 | 0.2761 | 0.0152 | 4.101000e-74 | 191764 | ++++ | 9960099 |
| 5976830 | 7:130025713_G_A | 7 | 130025713 | G | A | 0.9530 | -0.1365 | 0.0230 | 3.068000e-09 | 191764 | ---- | 9960099 |
| 6092347 | 7:157038803_A_G | 7 | 157038803 | G | A | 0.4626 | -0.0502 | 0.0088 | 1.127000e-08 | 191764 | ---- | 9960099 |
In [8]:
Copied!
flank = 100000
for index,row in lead_variants.iterrows():
mysumstats.plot_mqq(mode="r",region=(row["CHR"],row["POS"]-flank,row["POS"]+flank),region_grid=True,mqqratio=2,taf=[4,0,0.95,1,1],figargs={"figsize":(15,5),"dpi":300})
#vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.chr7.split_norm_af.vcf.gz")
flank = 100000
for index,row in lead_variants.iterrows():
mysumstats.plot_mqq(mode="r",region=(row["CHR"],row["POS"]-flank,row["POS"]+flank),region_grid=True,mqqratio=2,taf=[4,0,0.95,1,1],figargs={"figsize":(15,5),"dpi":300})
#vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.chr7.split_norm_af.vcf.gz")
Fri Oct 21 01:20:36 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:20:36 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:20:36 2022 -Raw input contains 707780 variants... Fri Oct 21 01:20:36 2022 -Plot layout mode is : r Fri Oct 21 01:20:36 2022 -Region to plot : chr7:13788699-13988699. Fri Oct 21 01:20:36 2022 -Extract SNPs in region : chr7:13788699-13988699... Fri Oct 21 01:20:37 2022 -Extract SNPs in specified regions: 1225 Fri Oct 21 01:20:37 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:20:37 2022 Start conversion and sanity check: Fri Oct 21 01:20:37 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:20:37 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:20:37 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:20:37 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:20:37 2022 -Maximum -log10(P) values is 8.600845666041783 . Fri Oct 21 01:20:37 2022 Finished data conversion and sanity check. Fri Oct 21 01:20:37 2022 Start to create manhattan plot with 1224 variants: Fri Oct 21 01:20:37 2022 -Loading gtf files from:defualt Fri Oct 21 01:20:46 2022 -plotting gene track.. Fri Oct 21 01:20:46 2022 -Finished plotting gene track.. Fri Oct 21 01:20:46 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:20:46 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:20:46 2022 -Skip annotating Fri Oct 21 01:20:46 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:20:46 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:20:46 2022 -Raw input contains 707780 variants... Fri Oct 21 01:20:46 2022 -Plot layout mode is : r Fri Oct 21 01:20:46 2022 -Region to plot : chr7:14798282-14998282. Fri Oct 21 01:20:46 2022 -Extract SNPs in region : chr7:14798282-14998282... Fri Oct 21 01:20:47 2022 -Extract SNPs in specified regions: 993 Fri Oct 21 01:20:47 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:20:47 2022 Start conversion and sanity check: Fri Oct 21 01:20:47 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:20:47 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:20:47 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:20:47 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:20:47 2022 -Maximum -log10(P) values is 11.631527161559639 . Fri Oct 21 01:20:47 2022 Finished data conversion and sanity check. Fri Oct 21 01:20:47 2022 Start to create manhattan plot with 993 variants: Fri Oct 21 01:20:47 2022 -Loading gtf files from:defualt Fri Oct 21 01:20:56 2022 -plotting gene track.. Fri Oct 21 01:20:56 2022 -Finished plotting gene track.. Fri Oct 21 01:20:56 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:20:56 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:20:56 2022 -Skip annotating Fri Oct 21 01:20:56 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:20:56 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:20:56 2022 -Raw input contains 707780 variants... Fri Oct 21 01:20:56 2022 -Plot layout mode is : r Fri Oct 21 01:20:56 2022 -Region to plot : chr7:44074857-44274857. Fri Oct 21 01:20:56 2022 -Extract SNPs in region : chr7:44074857-44274857... Fri Oct 21 01:20:58 2022 -Extract SNPs in specified regions: 929 Fri Oct 21 01:20:58 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:20:58 2022 Start conversion and sanity check: Fri Oct 21 01:20:58 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:20:58 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:20:58 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:20:58 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:20:58 2022 -Maximum -log10(P) values is 11.273680387889225 . Fri Oct 21 01:20:58 2022 Finished data conversion and sanity check. Fri Oct 21 01:20:58 2022 Start to create manhattan plot with 929 variants: Fri Oct 21 01:20:58 2022 -Loading gtf files from:defualt Fri Oct 21 01:21:07 2022 -plotting gene track.. Fri Oct 21 01:21:07 2022 -Finished plotting gene track.. Fri Oct 21 01:21:07 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:21:07 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:21:07 2022 -Skip annotating Fri Oct 21 01:21:07 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:21:07 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:21:07 2022 -Raw input contains 707780 variants... Fri Oct 21 01:21:07 2022 -Plot layout mode is : r Fri Oct 21 01:21:07 2022 -Region to plot : chr7:69306661-69506661. Fri Oct 21 01:21:07 2022 -Extract SNPs in region : chr7:69306661-69506661... Fri Oct 21 01:21:09 2022 -Extract SNPs in specified regions: 501 Fri Oct 21 01:21:09 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:21:09 2022 Start conversion and sanity check: Fri Oct 21 01:21:09 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:21:09 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:21:09 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:21:09 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:21:09 2022 -Maximum -log10(P) values is 15.31238187042823 . Fri Oct 21 01:21:09 2022 Finished data conversion and sanity check. Fri Oct 21 01:21:09 2022 Start to create manhattan plot with 501 variants: Fri Oct 21 01:21:09 2022 -Loading gtf files from:defualt Fri Oct 21 01:21:18 2022 -plotting gene track.. Fri Oct 21 01:21:18 2022 -Finished plotting gene track.. Fri Oct 21 01:21:18 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:21:18 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:21:18 2022 -Skip annotating Fri Oct 21 01:21:18 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:21:18 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:21:18 2022 -Raw input contains 707780 variants... Fri Oct 21 01:21:18 2022 -Plot layout mode is : r Fri Oct 21 01:21:18 2022 -Region to plot : chr7:127153550-127353550. Fri Oct 21 01:21:18 2022 -Extract SNPs in region : chr7:127153550-127353550... Fri Oct 21 01:21:19 2022 -Extract SNPs in specified regions: 713 Fri Oct 21 01:21:19 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:21:19 2022 Start conversion and sanity check: Fri Oct 21 01:21:19 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:21:19 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:21:19 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:21:19 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:21:19 2022 -Maximum -log10(P) values is 73.38711023071251 . Fri Oct 21 01:21:19 2022 Finished data conversion and sanity check. Fri Oct 21 01:21:19 2022 Start to create manhattan plot with 713 variants: Fri Oct 21 01:21:19 2022 -Loading gtf files from:defualt Fri Oct 21 01:21:28 2022 -plotting gene track.. Fri Oct 21 01:21:28 2022 -Finished plotting gene track.. Fri Oct 21 01:21:28 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:21:28 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:21:28 2022 -Skip annotating Fri Oct 21 01:21:28 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:21:28 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:21:28 2022 -Raw input contains 707780 variants... Fri Oct 21 01:21:28 2022 -Plot layout mode is : r Fri Oct 21 01:21:28 2022 -Region to plot : chr7:129925713-130125713. Fri Oct 21 01:21:28 2022 -Extract SNPs in region : chr7:129925713-130125713... Fri Oct 21 01:21:30 2022 -Extract SNPs in specified regions: 808 Fri Oct 21 01:21:30 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:21:30 2022 Start conversion and sanity check: Fri Oct 21 01:21:30 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:21:30 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:21:30 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:21:30 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:21:30 2022 -Maximum -log10(P) values is 8.513144644723056 . Fri Oct 21 01:21:30 2022 Finished data conversion and sanity check. Fri Oct 21 01:21:30 2022 Start to create manhattan plot with 808 variants: Fri Oct 21 01:21:30 2022 -Loading gtf files from:defualt Fri Oct 21 01:21:39 2022 -plotting gene track.. Fri Oct 21 01:21:39 2022 -Finished plotting gene track.. Fri Oct 21 01:21:39 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:21:39 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:21:39 2022 -Skip annotating Fri Oct 21 01:21:39 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:21:39 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:21:39 2022 -Raw input contains 707780 variants... Fri Oct 21 01:21:39 2022 -Plot layout mode is : r Fri Oct 21 01:21:39 2022 -Region to plot : chr7:156938803-157138803. Fri Oct 21 01:21:39 2022 -Extract SNPs in region : chr7:156938803-157138803... Fri Oct 21 01:21:40 2022 -Extract SNPs in specified regions: 1121 Fri Oct 21 01:21:40 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:21:40 2022 Start conversion and sanity check: Fri Oct 21 01:21:40 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:21:40 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:21:40 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:21:40 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:21:40 2022 -Maximum -log10(P) values is 7.948076083953893 . Fri Oct 21 01:21:40 2022 Finished data conversion and sanity check. Fri Oct 21 01:21:40 2022 Start to create manhattan plot with 1121 variants: Fri Oct 21 01:21:41 2022 -Loading gtf files from:defualt Fri Oct 21 01:21:50 2022 -plotting gene track.. Fri Oct 21 01:21:50 2022 -Finished plotting gene track.. Fri Oct 21 01:21:50 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:21:50 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:21:50 2022 -Skip annotating
In [9]:
Copied!
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,
vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.chr7.split_norm_af.vcf.gz",taf=[4,0,0.95,1,1])
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,
vcf_path="/home/yunye/mydata/d_disk/eas_1kg_af/EAS.chr7.split_norm_af.vcf.gz",taf=[4,0,0.95,1,1])
Fri Oct 21 01:24:42 2022 Start to plot manhattan/qq plot with the following basic settings: Fri Oct 21 01:24:42 2022 -Genome-wide significance level is set to 5e-08 ... Fri Oct 21 01:24:42 2022 -Raw input contains 707780 variants... Fri Oct 21 01:24:42 2022 -Plot layout mode is : r Fri Oct 21 01:24:42 2022 -Region to plot : chr7:156538803-157538803. Fri Oct 21 01:24:42 2022 -Extract SNPs in region : chr7:156538803-157538803... Fri Oct 21 01:24:43 2022 -Extract SNPs in specified regions: 5831 Fri Oct 21 01:24:43 2022 Finished loading specified columns from the sumstats. Fri Oct 21 01:24:43 2022 Start conversion and sanity check: Fri Oct 21 01:24:43 2022 -Removed 0 variants with nan in P column ... Fri Oct 21 01:24:43 2022 -P values are being converted to -log10(P)... Fri Oct 21 01:24:43 2022 -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed... Fri Oct 21 01:24:43 2022 -Sanity check: 0 na/inf/-inf variants will be removed... Fri Oct 21 01:24:43 2022 -Maximum -log10(P) values is 7.948076083953893 . Fri Oct 21 01:24:43 2022 Finished data conversion and sanity check. Fri Oct 21 01:24:43 2022 Start to load reference genotype... Fri Oct 21 01:24:43 2022 -reference vcf path : /home/yunye/mydata/d_disk/eas_1kg_af/EAS.chr7.split_norm_af.vcf.gz Fri Oct 21 01:25:26 2022 -Retrieving index... 16384 Fri Oct 21 01:25:26 2022 -Calculating Rsq... Fri Oct 21 01:25:26 2022 Finished loading reference genotype successfully! Fri Oct 21 01:25:26 2022 Start to create manhattan plot with 5831 variants: Fri Oct 21 01:25:26 2022 -Loading gtf files from:defualt Fri Oct 21 01:25:35 2022 -plotting gene track.. Fri Oct 21 01:25:35 2022 -Finished plotting gene track.. Fri Oct 21 01:25:35 2022 -Found 1 significant variants with a sliding window size of 500 kb... Fri Oct 21 01:25:35 2022 Finished creating Manhattan plot successfully! Fri Oct 21 01:25:35 2022 -Skip annotating
Out[9]:
(<Figure size 1500x1000 with 4 Axes>, <gwaslab.Log.Log at 0x7fb2af198040>)
In [ ]:
Copied!